This is the title

Tony Liang

University of British Columbia

December 31, 2023

About myself

  • PhD student in Bioinformatics under Dr. Amrit Singh supervision

  • BSc. in Math + minor in Data Science

  • What does “bioinformatician” do?

    • Assist researchers like you to better understand what your data means
    • We’re just coders that know little bit more bio
  • Currently working in creating pipeline, tools, models analyzing biological data in an automated-fashion

    • Focused on machine learning & AI
    • Reproducible workflows

Single Cell RNA-seq

What is single cell RNA sequencing?

Loading data

The data is from a study (Kang et al. 2018) and publicly avaliable through R’s ExperimentHub function

eh <- ExperimentHub() #  Initialize the hub as some list object
sce <- eh[["EH2259"]] # We could then extract the match entry by taking this entry from out hub
# Then print it
sce
class: SingleCellExperiment 
dim: 35635 29065 
metadata(0):
assays(1): counts
rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
rowData names(2): ENSEMBL SYMBOL
colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):
  • After loaded data should inspect basic information
  • What are rows? column? size of data?
  • Rows = genes
  • Columns = cells

Preprocessing of data before analysis

The data retrieved is rawest form, not all of it is suitable for analysis.

Caveats:

  • Undetected genes
  • Cells with very few or many detected genes
  • Lowly expressed genes
  • unnormalized expression values

This is usually the quality control (QC) step. This could potentially be another tutorial, so not deeply covered today.

We only perfom simple actions

Remove undetected genes

sce <- sce[rowSums(counts(sce) > 0) > 0, ]
sce
class: SingleCellExperiment 
dim: 18890 29065 
metadata(0):
assays(1): counts
rownames(18890): RP11-34P13.8 AL627309.1 ... S100B PRMT2
rowData names(2): ENSEMBL SYMBOL
colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):
  • Reduced from \(35635\) genes to \(18890\), nearly 16K genes that were not detected in any cells

Remove cells with few or many detected genes

# We need additional support to compute per cell quality control metrics from scater package
qc <- scater::perCellQCMetrics(sce)
# remove cells with few or many detected genes
ol <- scater::isOutlier(metric = qc$detected, nmads = 2, log = TRUE)
# Then we remove those are consider as outlier
sce <- sce[, !ol] # This means retain column that are not ol
dim(sce)
[1] 18890 26820
  • Now we changed 29065 cells to 26820 cells, where these cells are either overly abundant or too few

Remove lowly expressed genes

# Similar to early, see pattern now with rowSums
# But we want to at least have 10 cells that high expression value, this threshold could change
sce <- sce[rowSums(counts(sce) > 1) >= 10, ]
sce
class: SingleCellExperiment 
dim: 7118 26820 
metadata(0):
assays(1): counts
rownames(7118): NOC2L HES4 ... S100B PRMT2
rowData names(2): ENSEMBL SYMBOL
colnames(26820): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):
  • Interesting now, we only retained 7118 genes from original 35K and 26820 cells from original 29K.
  • This would be our final “filtered data”, but not entirely ready for analysis
    • Need to normalize these expression values

Normalize expression values

Calculate a log2-transformed normalized expression values:

  • dividing each count by its size factor
  • adding pseudo count of 1
  • log transforming
# compute sum-factors & normalize
sce <- scater::computeLibraryFactors(sce)
sce <- scater::logNormCounts(sce)
sce
class: SingleCellExperiment 
dim: 7118 26820 
metadata(0):
assays(2): counts logcounts
rownames(7118): NOC2L HES4 ... S100B PRMT2
rowData names(2): ENSEMBL SYMBOL
colnames(26820): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(6): ind stim ... multiplets sizeFactor
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):

This is our finalized data that have gone through series of QC steps. Now, we move to muscat

MUSCAT

multi-sample multi-group scRNA-seq analysis tools (Crowell et al. 2020)

It expects a SCE and requires cell metadata columns to have:

  • sample_id : sample identifier i.e. Nautilus_trt_3
  • cluster id: subpopulation (cluster assignment) i.e. T cells, monocytes
  • group id: experimental group/condition i.e. control/treatment, healthy/diseased
sce$id <- paste0(sce$stim, sce$ind)
sce <- muscat::prepSCE(sce, 
    kid = "cell", # subpopulation assignments
    gid = "stim",  # group IDs (ctrl/stim)
    sid = "id",   # sample IDs (ctrl/stim.1234)
    drop = TRUE)  # drop all other colData columns
sce
class: SingleCellExperiment 
dim: 7118 26820 
metadata(1): experiment_info
assays(2): counts logcounts
rownames(7118): NOC2L HES4 ... S100B PRMT2
rowData names(2): ENSEMBL SYMBOL
colnames(26820): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(3): cluster_id sample_id group_id
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):

MUSCAT

Sample level analysis (pseudobulk)

Cell level analysis (Important, what we care, but limited)

  • Fit linear mixed model for each gene to cell level measurement data

Under the hood, these involves statistics modelling, mainly linear regression in high dimension setting.

Sample level

class: SingleCellExperiment 
dim: 7118 16 
metadata(2): experiment_info agg_pars
assays(8): B cells CD14+ Monocytes ... Megakaryocytes NK cells
rownames(7118): NOC2L HES4 ... S100B PRMT2
rowData names(2): ENSEMBL SYMBOL
colnames(16): ctrl101 ctrl1015 ... stim1256 stim1488
colData names(1): group_id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

Sample level – model fitting

   sample_id group_id n_cells
1    ctrl107     ctrl     586
2   ctrl1016     ctrl    2062
3   ctrl1256     ctrl    2267
4   ctrl1488     ctrl    2229
5   ctrl1015     ctrl    2880
6    ctrl101     ctrl     912
7   ctrl1244     ctrl    2091
8   ctrl1039     ctrl     436
9    stim101     stim    1197
10  stim1488     stim    2779
11  stim1244     stim    1642
12  stim1256     stim    2127
13  stim1015     stim    2494
14  stim1016     stim    1901
15  stim1039     stim     639
16   stim107     stim     578
[1] "B cells"           "CD14+ Monocytes"   "CD4 T cells"      
[4] "CD8 T cells"       "Dendritic cells"   "FCGR3A+ Monocytes"
[7] "Megakaryocytes"    "NK cells"         

Sample level – between cluster concordance

  • which genes are DE in only a single (or very few) clusters?
  • How many DE genes are shared between clusters?

Sample level – pseudobulk heatmap top N DS genes per cluster

  • Good when wanting to gain an overview of numerous DE testing results for many clusters

Sample level – pseudobulk heatmap top N DS genes for single cluster

Moreover, it provides a set of options regarding which cluster(s), gene(s), and comparison to include (arguments k, g and c, respectively).

Sample level – pseudobulk heatmap for single gene all clusters

  • Also could visualize cluster-sample means of a single gene of interest across all clusters
    • Identify cell-types that are affected similarly by different experimental conditions

Cell level

  • For changes of high interest, we can view the cell-level expression profiles of a specific gene across samples or groups

Conclusion

  • scRNA-seq data is messy by nature — preprocessing is essential before any analysis.
  • We used scater for:
    • Removing undetected/low-quality genes & cells
    • Normalizing expression values
  • MUSCAT provides series of convenient functions to analyze scRNA-seq data
    • Although it focuses on bulk-level by aggregating them cells

Thanks!

  • Dr. Amrit Singh
  • Dr. Young Woong Kim
  • Dr. Maryam Ahmadzadeh
  • Rishika Daswani
  • Roy He
  • Michael Yoon
  • Jeffrey Tang
  • Akshdeep Sandhu
  • Yovindu Don
  • Raam Sivakumar
  • Prabhleen Sandhu
  • Mingming Zhang
  • Samuel Leung

Reference

Crowell, Helena L, Charlotte Soneson, Pierre-Luc Germain, Daniela Calini, Ludovic Collin, Catarina Raposo, Dheeraj Malhotra, and Mark D Robinson. 2020. “Muscat Detects Subpopulation-Specific State Transitions from Multi-Sample Multi-Condition Single-Cell Transcriptomics Data.” Nature Communications 11 (1): 6077.
Kang, Hyun Min, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, et al. 2018. “Multiplexed Droplet Single-Cell RNA-Sequencing Using Natural Genetic Variation.” Nature Biotechnology 36 (1): 89–94.
Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15: 1–21.
Ritchie, Matthew E, Belinda Phipson, DI Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47–47.
Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40.